41 research outputs found
Accelerated focused crawling through online relevance feedback
The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded
Real-time augmented reality filters expressive of user sentiment
Body language and facial expressions are an important component of human communication. Some messaging applications include features to send emoji, animated GIFs, etc. to express emotion. However, such content does not include the user’s image. This disclosure describes techniques that enable users to choose augmented reality effects that are added to a user’s image and that help users express an emotion
Recommended from our members
Enhanced classification through exploitation of hierarchical structures
textHumans often organize information by encoding it in structures that link
together entities such as concepts, objects, properties etc. Among the various
structures possible, hierarchies are commonly used. For instance, taxonomies
of categories commonly employ hierarchies to indicate that one category “is a”
type of another. The Yahoo! Web Directory and the Open Directory Project
are two examples of large taxonomies where topics are hierarchically arranged.
Hierarchies are also used to recursively decompose composite objects into their
constituent parts. Examples of this are webpages that can be parsed and then
represented as DOM-trees, where the DOM nodes correspond to sections of
the webpages.
In this thesis we argue that these hierarchical relationships between entities can be exploited to facilitate common data mining tasks defined upon
them, like automated classification. Specifically, we show that the information
encoded in these hierarchies can be reduced to constraints on class membership scores that can then be enforced as a post-processing step to enhance the accuracy of classification. We demonstrate our ideas and algorithms on three
real-world tasks.
First, we tackle the problem of classification into hierarchical taxonomies.
We show how different taxonomy structures can be translated into constraints
on the outputs of classifiers learned at the nodes of the hierarchy. In addition,
we give algorithms to optimally enforce these constraints and show that this
results in improved classification accuracy. In cases where the taxonomies
are not available, we give an approach to automatically derive hierarchical
relationships amongst a flat set of categories. Next, we work on the problem
of detecting noisy (templated) parts of webpages. We give algorithms that
rate each section of a webpage in terms of how templated it is. Then we show
that smoothing the output of these template classifiers over the DOM-tree
hierarchy improves the template detection performance of our system. Finally,
we investigate the task of segmenting websites into topically cohesive regions.
We define a framework and within it a set of measures that characterize good
segmentations, and give an efficient algorithm to find the best segmentation
within this framework.
We formalize the problem of enforcing constraints on the outputs of classifiers as regularized isotonic or unimodal regression on rooted trees; these are
generalizations of the classic isotonic regression problem. The nature of the
constraints as well as the cost functions is different in each of the applications
mentioned above. For all these formulations we give efficient algorithms to optimally smooth the classifier outputs. These novel formulations and algorithms
might be of interest independent of the applications in this thesis.Electrical and Computer Engineerin
Clump: A scalable and robust framework for structure discovery
We introduce a robust and efficient framework called CLUMP (CLustering Using Multiple Prototypes) for unsupervised discovery of structure in data. CLUMP relies on finding multiple prototypes that summarize the data. Clustering the prototypes enables our algorithm to scale up to extremely large and high-dimensional domains such as text data. Other desirable properties include robustness to noise and parameter choices. In this paper, we describe the approach in detail, characterize its performance on a variety of datasets, and compare it to some existing model selection approaches.